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Abstract 

This paper presents new and effective algorithms for learning kernels. In particular, as shown by 
our empirical results, these algorithms consistently outperform the so-called uniform combination 
solution that has proven to be difficult to improve upon in the past, as well as other algorithms for 
learning kernels based on convex combinations of base kernels in both classification and regression. 
Our algorithms are based on the notion of centered alignment which is used as a similarity measure 
between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and 
empirical results for learning kernels based on our notion of centered alignment. In particular, 
we describe efficient algorithms for learning a maximum alignment kernel by showing that the 
problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel 
and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical 
results include a novel concentration bound for centered alignment between kernel matrices, the 
proof of the existence of effective predictors for kernels with high alignment, both for classification 
and for regression, and the proof of stability-based generalization bounds for a broad family of 
algorithms for learning kernels based on centered alignment. We also report the results of extensive 
experiments with our centered alignment-based algorithms in both classification and regression. 

Keywords: Kernel methods, learning kernels, feature selection. 



1. Introduction 

One of the key steps in the design of learning algorithms is the choice of the features. This choice 
is typically left to the user and represents his prior knowledge about the task. Clearly, with a poor 
choice, random features in the extreme case, no learning algorithm could be successful. With a good 
choice, one including the target label in the optimal case, high accuracy can be achieved. 

For kernel-based algorithms, which are widely used in machine learning, the features are pro- 
vided intrinsically in a high-dimensional space via the choice of a positive-definite symmetric kernel 
function (Boser et al., 1992; Cortes and Vapnik, 1995; Vapnik, 1998). To limit the risk of a poor 



©201 1 Corinna Cortes, Mehryar Mohri and Afshin Rostamizadeh. 



Cortes, Mohri and Rostamizadeh 



choice of kernel, in the last decade or so, a number of publications have investigated the idea of 
learning the kernel from data (Cristianini et al., 2001; Chapelle et al., 2002; Bousquet and Her- 
rmann, 2002; Lanckriet et al., 2004; Jebara, 2004; Argyriou et al., 2005; Micchelli and Pontil, 2005; 
Lewis et al., 2006; Argyriou et al., 2006; Cortes et al., 2008; Sonnenburg et al., 2006; Srebro and 
Ben-David, 2006; Zien and Ong, 2007; Cortes et al., 2009a, 2010a,b). This reduces the requirement 
from the user to only specifying a family of kernels rather than a specific kernel. The task of se- 
lecting (or learning) a kernel out of that family is then reserved to the learning algorithm which, as 
for standard kernel-based methods, must also use the data to choose a hypothesis in the reproducing 
kernel Hilbert space (RKHS) associated to the kernel selected. 

Different kernel families have been studied in the past, but the most widely used one has been 
that of convex combinations of a finite set of base kernels. However, while different learning kernel 
algorithms have been introduced in that case, including those of Lanckriet et al. (2004), to our 
knowledge, in the past, none has succeeded in consistently and significantly outperforming the 
uniform combination solution, in classification or regression tasks. The uniform solution consists 
of simply learning a hypothesis out of the RKHS associated to a uniform combination of the base 
kernels. This disappointing performance of learning kernel algorithms has been pointed out in 
different instances, including by many participants at different NIPS workshops organized on the 
theme in 2008 and 2009, as well as in a survey talk (Cortes, 2009). The empirical results we 
report further confirm that. Other kernel families have been considered in the literature, including 
hyperkernels (Ong et al., 2005), Gaussian kernel families (?), or non-linear families (Bach, 2008; 
Cortes et al., 2009b; Varma and Babu, 2009). However, the performance reported for these other 
families does not seem to be consistently superior to that of the uniform combination either. 

In contrast, on the theoretical side, favorable guarantees have been derived for learning kernels. 
For general kernel families, learning bounds based on covering numbers were given by Srebro and 
Ben-David (2006). Margin-based generalization bounds based on an analysis of the Rademacher 
complexity with only a logarithmic dependency on the number of base kernels were given by Cortes 
et al. (2010b) for convex combinations of kernels with an L\ constraint, as well as other optimal 
bounds for other L q constraints. These learning guarantees suggest that learning kernel algorithms 
even with a relatively large number of base kernels could achieve a good performance. 

This paper presents new algorithms for learning kernels whose performance is more consis- 
tent with expectations based on these theoretical guarantees. In particular, as can be seen by our 
experimental results, several of the algorithms we describe consistently outperform the uniform 
combination solution. They also surpass in performance the algorithm of Lanckriet et al. (2004) in 
classification and improve upon that of Cortes et al. (2009a) in regression. Thus, this can be viewed 
as the first series of algorithmic solutions for learning kernels in classification and regression with 
consistent performance improvements. 

Our learning kernel algorithms are based on the notion of centered alignment which is a sim- 
ilarity measure between kernels or kernel matrices. This can be used to measure the similarity of 
each base kernel with the target kernel Ky derived from the output labels. Our definition of cen- 
tered alignment is close to the uncentered kernel alignment originally introduced by Cristianini et al. 
(2001). This closeness is only superficial however: as we shall see both from the analysis of several 
cases and from experimental results, in contrast with our notion of alignment, the uncentered kernel 
alignment of Cristianini et al. (2001) does not correlate well with performance and thus, in general, 
cannot be used effectively for learning kernels. 
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We present a number of novel algorithmic, theoretical, and empirical results for learning kernels 
based on our notion of centered alignment. In Section 2, we introduce and analyze the properties 
of centered alignment between kernel functions and kernel matrices, and discuss its benefits. In 
particular, the importance of the centering is justified theoretically and validated empirically. We 
then describe several algorithms based on the notion of centered alignment in Section 3. 

We present two algorithms that work in two subsequent stages (Sections 3.1 and 3.2): the first 
stage consists of learning a kernel K that is a non-negative linear combination of p base kernels; 
the second stage combines this kernel with a standard kernel-based learning algorithm such as sup- 
port vector machines (SVMs) (Cortes and Vapnik, 1995) for classification, or KRR for regression 
(Saunders et al., 1998), to select a prediction hypothesis. These two algorithms differ in the way 
centered alignment is used to learn K. The simplest and straightforward to implement algorithm se- 
lects the weight of each base kernel matrix independently, only from the centered alignment of that 
matrix with the target kernel matrix. The most accurate algorithm instead determines these weights 
jointly by measuring the centered alignment of a convex combination of base kernel matrices with 
the target one. We show that this more accurate algorithm is very efficient by proving that the base 
kernel weights can be obtained by solving a simple quadratic program (QP). We also give a closed- 
form expression for the weights in the case of a linear combination not necessarily convex. Note 
that an alternative two-stage technique consists of first learning a prediction hypothesis using each 
base kernel and then learning the best linear combination of these hypotheses. But, as pointed out 
in Section 3.3, in general, such ensemble-based techniques make use of a richer hypothesis space 
than the one used by learning kernel algorithms. We also present and analyze an algorithm that 
uses centered alignment to both select a convex combination kernel and a hypothesis based on that 
kernel, these two tasks being performed in a single stage by solving a single optimization problem 
(Section 3.4). 

We also present an extensive theoretical analysis of the notion of centered alignment and algo- 
rithms based on that notion. We prove a concentration bound for the notion of centered alignment 
showing that the centered alignment of two kernel matrices is sharply concentrated around the cen- 
tered alignment of the corresponding kernel functions, the difference being bounded by a term in 
0(l/y/m) for samples of size m (Section 4.1). Our result is simpler and directly bounds the dif- 
ference between these two relevant quantities, unlike previous work by Cristianini et al. (2001) 
(for uncentered alignments). We also show the existence of good predictors for kernels with high 
centered alignment, both for classification and for regression (Section 4.2). This result justifies 
the search for good learning kernel algorithms based on the notion of centered alignment. We 
note that the proofs given for similar results in classification for uncentered alignments are erro- 
neous (Cristianini et al., 2001, 2002). We also present stability-based generalization bounds for 
two-stage learning kernel algorithms based on centered alignment when the second stage is kernel 
ridge regression (Section 4.3). We further study the application of these bounds in the case of our 
alignment maximization algorithm and initiate a detailed analysis of the stability of this algorithm 
(Appendix B). 

Finally, in Section 5, we report the results of extensive experiments with our centered alignment- 
based algorithms both in classification and regression, and compare our results with L\- and Li- 
regularized learning kernel algorithms (Lanckriet et al., 2004; Cortes et al., 2009a), as well as with 
the uniform kernel combination method. The results show an improvement both over the uniform 
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combination and over the one-stage kernel learning algorithms. They also demonstrate a strong 
correlation between the centered alignment achieved and the performance of the algorithm. l 

2. Alignment definitions 

The notion of kernel alignment was first introduced by Cristianini et al. (2001). Our definition of 
kernel alignment is different and is based on the notion of centering in the feature space. Thus, we 
start with the definition of centering and the analysis of its relevant properties. 

2.1 Centered kernel functions 

Let D be the distribution according to which training and test points are drawn. Centering a feature 
mapping $ : X — >■ H consists of replacing it by $ — E x [<J>] , where E x denotes the expected value of 
$ when x is drawn according to the distribution D. Centering a positive definite symmetric (PDS) 
kernel function K : X x X — > M. consists of centering any feature mapping <& associated to K. 
Thus, the centered kernel K c associated to K is defined for all x, x' G X by 

KJx,x') = ($(a?) - E[$]) T ($(x ) - E[$l) 

x x ' 

= K(x,x') - E[K(x,x')] - E[K(x,x')] + E [K(x,x')]. 

x x' x,x' 

This also shows that the definition does not depend on the choice of the feature mapping associated 
to K. Since K c (x, x') is defined as an inner product, K c is also a PDS kernel. Note also that for a 
centered kernel K c , E XtX i[K c (x, x')] = 0. That is, centering the feature mapping implies centering 
the kernel function. 

2.2 Centered kernel matrices 

Similar definitions can be given for a finite sample S = (xi, . . . , x m ) drawn according to D: a 
feature vector 3>(ajj) with % G [1, m] is then centered by replacing it with $(xj) — <3?, with $ = 
— Ya^=i ^i x i)> an d the kernel matrix K associated to K and the sample S is centered by replacing 
it with K c defined for all i, j G [1, in] by 

.. m -. m -. m 

[Kc]« = Ky " - E K ^' " - E K« + —2 E ««■ (D 

t=l j=l tj=l 

Let $ = [$(xi), . . . , $(x m )] T and * = [¥,... , $] T . Then, it is not hard to verify that K c = 
(<& — $)($ — $) T , which shows that K c is a positive semi-definite (PSD) matrix. Also, as with the 

1 v-^m r 



kernel function, -\ YlTj=i\^c]ij = 0. Let (•, -)p denote the Frobenius product and by || • \\f the 
Frobenius norm: 

VA,B G R mxm , (A,B) F = Tr[A T B] and ||A|| F = y/(A,A) F . 

Then, the following basic properties hold for centering kernel matrices. 



1. An earlier version of this work was presented in (Cortes et al., 2010a). This extended version includes a number of 
additional material, in particular additional empirical evidence supporting the importance of centered alignment, the 
description and discussion of a single-stage algorithm for learning kernels based on centered alignment, an analysis 
of unnormalized centered alignment and the corresponding proof of good predictors, generalization bounds for two- 
stage learning kernel algorithms based on centered alignment, and experimental results for the single-stage algorithm. 
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Lemma 1 Let 1 G M mxl denote the vector with all entries equal to one, and I the identity matrix. 
1. For any kernel matrix K G W nxm , the centered kernel matrix K c can be given by 
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11 T " 






11 T 1 


1 
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1 






m 






m 



2. For any two kernel matrices K and K', 

<K C ,K' C ) F = (K,K' C ) F = (K C ,K') F . 

Proof The first statement can be shown straightforwardly from the definition of K c , Equation (1). 
The second statement follows from 

r n T i r n T i r n T i r n T i 

(K C ,K') F = Tr I-—KI-— I-_K'I-— , 

m \ m \ m \ m 

the fact that [I - ^11 T ] 2 = I c = [I - ^11 T ], and the trace property Tr[AB] = Tr[BA], valid for 
all matrices A, B G R mxm . ■ 

We shall use these properties in the proofs of the results presented in Section 4. 

2.3 Centered kernel alignment 

In the following sections, in the absence of ambiguity, to abbreviate the notation, we often omit the 
variables over which an expectation is taken. We define the alignment of two kernel functions as 
follows. 

Definition 2 (Kernel function alignment) Let K and K' be two kernel functions defined over X x 
X such that < E[iT 2 ] < +oo and < E[K' C ] < +oo. Then, the alignment between K and K' is 
defined by 



Since \V[K C K' C ]\ < ^E[K^]E[K' C % by the Cauchy-Schwarz inequality, we have p(K,K') G 
[—1, 1]. The following lemma shows more precisely that p{K, K 1 ) G [0, 1] when K and K' are 
PDS kernels. 

Lemma 3 For any two PDS kernels K and K', E[KK'] > 0. 

Proof Let ^ be a feature mapping associated to K and ^' a feature mapping associated to K'. By 
definition of ^ and ^', and using the properties of the trace, we can write: 

E [K(x,x')K'(x,x')] = E [V(x) T iB(x')ty'(x') T V'(x)] 

x,x' x,x' 

= E [Tv^ixf^ix'^'ix'y^'ix)]] 



{v[*(x)v\x)\mtfWW\)F = Hum, 
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Figure 1: (a) Representation of the distribution D. In this simple two-dimensional example, a frac- 
tion a of the points are at (—1,0) and have the label —1. The remaining points are at 
(1,0) and have the label +1. (b) Alignment values computed for two different definitions 
of alignment. The line in black plots the definition of alignment computed according to 
Cristianini et al. (2001) gives rise to the curved black line A 
our definition of centered alignment results in the straight blue line p = 1. 



or + (1-a) 2 ) 1 / 2 , while 



where U = E X [V (x)W (x) T \. ■ 

The lemma applies in particular to any two centered kernels K c and K' c which, as previously shown 
are PDS kernels if K and K' are PDS, thus, for any two PDS kernels K and K' , 

E[K C K' C ] > 0. 

The following similarly defines the alignment between two kernel matrices K and K' based on a 
finite sample S = (x±, . . . , x m ) drawn according to D. 

Definition 4 (Kernel matrix alignment) Let K G fl£ mxm an d K' e l mxm be two kernel matrices 
such that ||K c ||f ^ and ||K^.||i? ^ 0. Then, the alignment between K and K 7 is defined by 

<K C ,K' C ) F 



p(K,K') 



IK 



c\\F 



IK' 



Here too, by the Cauchy-Schwarz inequality, p(K, K') G [—1,1] and in fact p(K, K') > since the 
Frobenius product of any two positive semi-definite matrices K and K' is non-negative. Indeed, for 
such matrices, there exist matrices U and V such that K = UU T and K' = VV T . The statement 
follows from 



|U T V||| > 0. 



(2) 



(K,K') F = Tr(UU T VV T ) = Tr ((U T V) T (U T V)) 
This applies in particular to the kernel matrices of the PDS kernels K c and K' c : 

(K c ,K' c ) F >0. 

Our definitions of alignment between kernel functions or between kernel matrices differ from 
those originally given by Cristianini et al. (2001, 2002): 

E[KK>] 7 <K,K')f 



A 



^E[A' 2 ] E[K' 2 } 



A 



I k IHI k/ IIf : 
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KINEMATICS 
(REGR.) 


IONOSPHERE 
(REGR.) 


GERMAN 
(CLASS.) 


SPAMBASE 
(CLASS.) 


SPLICE 
(CLASS.) 


p 


0.9624 


0.9979 


0.9439 


0.9918 


0.9515 


A 


0.8627 


0.9841 


0.9390 


0.9889 


-0.4484 



Table 1 : The correlations of the alignment values and error- rates of various kernels. The top row re- 
ports the correlation of the accuracy of the base kernels used in Section 5 with the centered 
alignments p, the bottom row the correlation with the non-centered alignment A. 



which are thus in terms of K and K' instead of K c and K' c and similarly for matrices. This may 
appear to be a technicality, but it is in fact a critical difference. Without that centering, the definition 
of alignment does not correlate well with performance. To see this, consider the standard case where 
K' is the target label kernel, that is K'(x,x') = yy', with y the label of x and y' the label of x', and 
examine the following simple example in dimension two {X = IR 2 ), where K(x, x') = x ■ x' + 1 
and where the distribution, D, is defined by a fraction a € [0, 1] of all points being at (—1,0) and 
labeled with —1, and the remaining points at (1,0) with label +1 as shown in Figure 1. 

Clearly, for any value of a G [0,1], the problem is separable for example with the simple vertical 
line going through the origin and one would expect the alignment to be 1. However, the alignment 
A can easily be calculated from the structure of the distribution D: 



E[K' 2 } = 


= 1, 








E[K 2 } = 


a 2 


4 + (l- 


a) 2 


4 + 2a(l- 


E[KK'\ 


= a 


2 -2 + (l 


— a) 


2 -2 + 2a(l 



a) -0 = 4(a 2 + (l-a) 2 ), 
-a)-0 = 2(a 2 + (l-a) 2 ), 

which finally gives A = (a 2 + (1 — a) 2 ) 1 ' 2 . Thus, A is never equal to one except for a = or 
a = 1 and for the balanced case, where a = 1/2, its value is A = "l/y/2 fa .707 < 1. In contrast, 
with our definition, p(K, K') = 1 for all a € [0, 1] (see Figure 1). 

This mismatch between A (or A) and the performance values can also be seen in real world 
datasets. Instances of this problem have been noticed by Meila (2003) and Pothin and Richard 
(2008) who have suggested various (input) data translation methods, and by Cristianini et al. (2002) 
who observed an issue for unbalanced data sets. Table 2.3, as well as Figure 2, give a series of 
empirical results in several classification and regression tasks based on datasets taken from the UCI 
Machine Learning Repository (http: //ar chive. ics .uci . edu/mi/) and Delve datasets (http: // 
www. cs.toronto. edu/ -delve /data /datasets .html). 

The table and the figure illustrate the fact that the quantity A measured with respect to several 
different kernels does not always correlate well with the performance achieved by each kernel. In 
fact, for the splice classification task, the non-centered alignment is negatively correlated with the 
accuracy, while a large positive correlation is expected of a good quality measure. The centered 
notion of alignment, p, however, shows good correlation along all datasets and is always better 
correlated than A. 

The notion of alignment seeks to capture the correlation between the random variables K(x,x') 
and K'ix, x') and one could think it natural, as for the standard correlation coefficients, to consider 
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Figure 2: Detailed view of the splice and kinematics experiments presented in Table 2.3. Both 
the centered (plots in blue) and non-centered alignment (plots in orange) are plotted as 
a function of the accuracy (for the regression problem in the kinematics task "accuracy" 
is 1 - RMSE). It is apparent from these plots that the non-centered alignment can be 
misleading when evaluating the quality of a kernel. 



the following definition: 



p'(K,K') 



E[(K-E[K])(K'-E[K'])] 
y/E[{K -E[K]f]E[(K> -E[K']f 



However, centering the kernel values, as opposed to centering the feature values, is not directly 
relevant to linear predictions in feature space, while our definition of alignment, p, is precisely 
related to that. Also, as already shown in Section 2.1, centering in the feature space implies the 
centering of the kernel values, since E[if c ] = and -\ Y^,Ti=il^-c]ij = for any kernel K and 
kernel matrix K. Conversely, however, centering of the kernel does not imply centering in feature 
space. 



3. Algorithms 

This section discusses several learning kernel algorithms based on the notion of centered alignment. 
In all cases, the family of kernels considered is that of non-negative combinations of p base kernels 
Kk, fcG [l,p]. Thus, the final hypothesis learned belongs to the reproducing kernel Hilbert space 
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(RKHS) Hr- associated to a kernel of the form K^ = Y%=i Wfc-^ib with /x > 0, which guarantees 
that Kfj, is PDS, and ||/u|| = A> 0, for some regularization parameter A. 

We first describe and analyze two algorithms that both work in two stages: in the first stage, 
these algorithms determine the mixture weights /x. In the second stage, they train a standard kernel- 
based algorithm, e.g., SVMs for classification, or KRR for regression, in combination with the 
kernel matrix K M associated to K^, to learn a hypothesis fteE Thus, these two-stage algorithms 
differ only by their first stage, which determines K^. We describe first in Section 3.1 a simple 
algorithm that determines each mixture weight fj,^ independently, (align), then, in Section 3.2, 
an algorithm that determines the weights //^s jointly (alignf) by selecting /x to maximize the 
alignment with the target kernel. We briefly discuss in Section 3.3 the relationship of such two- 
stage learning algorithms with algorithms based on ensemble techniques, which also consist of two 
stages. Finally, we introduce and analyze a single-stage alignment-based algorithm which learns /x 
and the hypothesis h E H^ simultaneously in Section 3.4. 

3.1 Independent alignment-based algorithm (align) 

This is a simple but efficient method which consists of using the training sample to independently 
compute the alignment between each kernel matrix K^ and the target kernel matrix Ky = yy T , 
based on the labels y, and to choose each mixture weight iXk proportional to that alignment. Thus, 
the resulting kernel matrix is defined by: 

p 



K M oc £ P(K fc , K Y )K k = —V E lit li K *" ® 

When the base kernel matrices K^. have been normalized with respect to the Frobenius norm, the 
independent alignment-based algorithm can also be viewed as the solution of a joint maximization 
of the unnormalized alignment defined as follows, with a norm-2 constraint on the norm of /x. 

Definition 5 (Unnormalized alignment) Let K and K' be two PDS kernels defined over X X X 
and K and K 7 their kernel matrices for a sample of size m. Then, the unnormalized alignment 
p u (K, K') between K and K' is defined by and the unnormalized alignment /5^(K, K') between K 
and K' are defined by 

p u (K,K')= E[K c (x,x')K' c {x,x')} and ^(K,K') = \ (K C ,K'>. 
x,x' m 

Since they are not normalized, the alignment values a and a are no longer guaranteed to be in the 
interval [0,1]. However, assuming the kernel function K and labels are bounded, the unnormalized 
alignment between K and Ky are bounded as well. 

Lemma 6 Let K be a PDS kernel. Assume that for all x £ X, K c (x, x) < R 2 and for all output 
label y, \y\ < M. Then, the following bounds hold: 

< p a (K, Ky) < MR 2 and < p u (K, Ky ) < MR 2 . 
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Proof The first lower bound holds by Lemma 3 by Inequality (2). The upper bounds can be obtained 
straightforwardly via the application of the Cauchy-Schwarz inequality: 

p 2 u (K,K Y )= E [K c (x,x')yy'} 2 < E[K 2 (x,x')]E[yy'} 2 <R 4 M 2 
(x,y),(x',y') x,x' y,y' 

p u (K,K') = -^{K C ,K Y ) F < ±\\K c \\ F \\K y \\ F < mR2 ™ M = R *M, 

where we used the identity (K c , Ky c )p = (K c , Ky)p from Lemma 1. ■ 

We shall consider more generally the corresponding optimization with norm-g constraint on /z, 
q > 1: 

p p 

max Pu(^2 VkK k , Ky) = (^2 ^fcK fe , K y )f (4) 

M k=i 

p 
subject to: VJ y? k < A. 



k=l fc=l 

P 



fc=l 



An explicit constraint enforcing fj, > is not necessary since, as we shall see, the optimal solution 
found satisfies this constraint without imposing that. 

i 
Proposition 7 Let n* be the solution of the optimization problem (4), then p,* k oc (K^, Ky)J, _1 . 

Proof The Lagrangian corresponding the optimization (4) is defined as follows, 

p p 

L(/x, P) = - J2 »k(K k , K Y ) F + p(%2 4 - A), 
fc=i fc=i 

where the dual variable /3 is non-negative. Differentiating with respect to //& and setting the result 
to zero gives 

Fir ~^— 

- 3-1 _ n . ,,. ^ /XC, KT.A9- 1 



which concludes the proof 



(K fc , K Y ) F + q^ q k r l = =► Mfe oc (K fc) Ky)f 



Thus, for q = 2, ix k oc (Kjt,Ky)i? that is exactly the solution given by Equation (3) modulo 
normalization by the Frobenius norm of the base matrix. Note that for q = 1, the optimization 
becomes trivial and can be solved by simply placing all the weight on p,^ with the largest coefficient, 
that is the pf, whose corresponding kernel matrix K& has the largest alignment with the target kernel. 

3.2 Alignment maximization algorithm 

The independent alignment-based method ignores the correlation between the base kernel matrices. 
The alignment maximization method takes these correlations into account. It determines the mixture 
weights /ifc jointly by seeking to maximize the alignment between the convex combination kernel 
k m = Z)fc=i MfcK fc and the target kernel K Y = yy T . 

This was also suggested in the case of uncentered alignment by Cristianini et al. (2001); Kandola 
et al. (2002a) and later studied by Lanckriet et al. (2004) who showed that the problem can be solved 
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as a QCQP. In what follows, we present even more efficient algorithms for computing the weights 
fik by showing that the problem can be reduced to a simple QP. We start by examining the case of a 
non-convex linear combination, where components of li can be negative, and show that the problem 
admits a closed-form solution in that case. We then partially use that solution to obtain the solution 
of the convex combination. 

3.2.1 Linear combination 

We can assume without loss of generality that the centered base kernel matrices K. kc are independent 
since otherwise we can select an independent subset. This condition ensures that ||K^ \\f > for 
arbitrary li and that p(K M ,yy T ) is well defined (Definition 4). By Lemma 1, (K M , ~Ky c )f = 
(K M ,Ky)p. Thus, since ||Ky c ||i? does not depend on li, the alignment maximization problem 
max^g^j ^(K^, yy T ) can be equivalently written as the following optimization problem: 

(K Me ,yy T ) F 

ma S ~~in? — ii ' ( 5) 

n£M ||iv Mc ||.F 

where A4 = {fi: \\li\\2 = 1}- A similar set can be defined via norm-1 instead of norm-2. As we 
shall see, however, the problem can be solved in the same way in both cases. Note that, by Lemma 1, 
K Mc = UmK^Um with U m = I - ll T /m, thus, 

V V V 

K Mc = U m (y^Mfc K fc)Um = ^AifeU m K fc U m = y^jUA:K fce . 
fe=l fe=i fc=i 

Let 

a=((K lc ,yy T ) F ,...,(K Pc ,yy T ) F ) T , 

and M denote the matrix defined by 

M M = (K kc ,Ki c ) F , 

for k, I £ [l,p]. Note that since the base kernels are assumed independent, the matrix M is invert- 
ible. Also, in view of the non-negativity of the Frobenius product of symmetric PSD matrices shown 
in Section 2.3, the entries of a and M are all non-negative. Observe also that M is a symmetric 
PSD matrix since for any vector X = (x\, . . . , x m ) T E M. m , 

in 
X T MX = Y^ x k xM k i 

k,l=l 

m 

= Tr I ^Z X k x l K kJ^lc 
k,l=l 
mm m 

= Tr [^(^ x k K kc )(^2 xiKi c ) = || ^2 x k K kc \\ 2 F > 0. 



fc=i i=i fe=i 

Proposition 8 The solution n* of the optimization problem (5) is given by li* = | i M -ia || 
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fj, a 



Proof With the notation introduced, problem (5) can be rewritten as a* = argmaxn„n„ _-, 

F V A- & llMl|2-l ^T^7 

Thus, clearly, the solution must verify fi* a > 0. We will square the objective and yet not enforce 
this condition since, as we shall see, it will be verified by the solution we find. Therefore, we 
consider the problem 



li 



argmax 



(//a) 2 



argmax 



/jJaa/x 



M || a =i fJ-Mfi || M || 2=1 /i ! M/i 



In the final equality, we recognize the general Rayleigh quotient. Let v = M 1 ' 2 ^ and u* 
M 1 / V, then 



v 



argmax 

||M- 1 / 2 v|| 2 =l 



i/TfM-VWM- 1 /^ 



is T v 



Hence, the solution is 



u 



argmax 

||M- 1 / 2 y|| 2 =l 



[i^M" 1 



/ 2 a l 2 



W\ 



argmax 



V 



\v\ 



rvT 1 / 2 ; 



Thus, u* G Vec(M _1 / 2 a) with ||M _1 / 2 ^*|| 2 = 1. This yields immediately /x* 
verifies /i* T a = a T M _1 a/||M _1 a|| > since M and M" 1 are PSD. 



nra^tir. which 



3.2.2 Convex combination (aiignf) 

In view of the proof of Proposition 8, the alignment maximization problem with the set M. 1 



|/x||2 = 1 A n > 0} can be written as 

/x* = argmax 



/i T aa T /i 



~iTeM'' M T M/x ' 

The following proposition shows that the problem can be reduced to solving a simple QP 
Proposition 9 Let v* be the solution of the following QP: 

min v Mv — 2v a. 

v>0 

Then, the solution fx* of the alignment maximization problem (6) is given by fi* = v*/||v" 



(6) 



(7) 



Proof Note that the objective function of problem (6) is invariant to scaling. The constraint ||/x|| = 1 
only serves to enforce < ||/x|| < +oo. Thus, using the same change of variable as in the proof of 
Proposition 8, we can instead solve the following problem from which we can retrieve the solution 
via normalization: 



r ,,t 



v = argmax 

0<||M- 1 / 2 iy|j 2 <+oo 
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Equivalently, we can solve the following problem for any finite A > 0: 

max r U T ]Vr 1/2 al 2 . 
M-V2 u >o 

||u||=A 

Observe that for M _1 ' 2 u > the inner product is non-negative: u T M _1 ' 2 a = M -1 ' 2 u T a > 0, 
since the entries of a are non-negative. Furthermore, it can be written as follows : 



.Illu-M-^af + ^luf + il 

2" " 2" " V 



i/M^a = -illu - M-^all 2 + ^||u|| 2 + ^llM-^all 2 



i||u-M- 1 / 2 a|| 2 + - + -||M- 1 /2 a f. 

2" "2 2" " 



Thus, the problem becomes equivalent to the minimization: 



mm 
M- 1 / 2 u>o 

||u||=A 



lu-M-^all 2 . (8) 



Now, we can omit the condition on the norm of u since (8) holds for arbitrary finite A > and since 
neither u = or any infinite norm u can be the solution even without this condition. Thus, we can 
now consider instead: 



mm 
M" 1 /2 u >o 



lu-M-^aH 2 . 



The change of variable u = M 1 / 2 v leads to: min v >o ||M 1 ' 2 v — M~ 1 ' 2 a|| . This is a standard 
least-square regression problem with non-negativity constraints, a simple and widely studied QP 
for which several families of algorithms have been designed. Expanding the terms, we obtain the 
equivalent problem: min v >o v T Mv — 2v T a. ■ 

Note that solving this QP problem does not require a matrix inversion of M. Also, it is not hard to 
see that this problem is equivalent to solving a hard margin SVM problem, thus, any SVM solver 
can also be used to solve it. A similar problem with the non-centered definition of alignment is 
treated by Kandola et al. (2002b), but their optimization solution differs from ours and requires 
cross-validation. 

Also, the assumption about the invertibility of matrix M is not necessary and a maximal align- 
ment solution can be computed using the same optimization as that of Proposition 9 in the non- 
invertible case. The optimization problem is then not strictly convex however and the alignment 
solution /j, not unique. 

We now further analyze the properties of the solution v of this optimization. Let po(fJ-) denote 
the partially normalized alignment maximized by (5): 

PoW = yy \\fp(w = u *r ii — = i w% = — / TA;r — rr— n • ( 9 ) 

The following proposition gives a simple expression for po(/-0- 

Proposition 10 For /x = v/||v||, with v/0 solution of the alignment maximization problem (7), 
the following identity holds: 

poO) = ||v||m- (10) 
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Proof Since ||v||j^ - 2v T a = ||v||j^ - 2(v,M _1 a) M = ||v - ]Vt _1 a||^ I - ||M _1 a||^i the 
optimization problem (7) can be equivalently written as 

min || v — M~ 3l\\m- 

v>0 

This implies that the solution v is the M-orthogonal projection of M~ 1 a over the convex set 
{v : v > 0}. Therefore, v — M _1 a is M-orthogonal to v: 

(v,v-M" 1 a) M = =► ||v||^ = (v,M" 1 a) M . 

Thus, 

(v,M- 1 a) M (/^,M- 1 a) M , , 

||v|| M = ij-jj = [j— jj = pip), 

II v IIm II/*I|m 

which concludes the proof. ■ 

Thus, the proposition gives a straightforward way of computing po (/•*)> thereby also p{fx), from the 
M-norm of the solution vector v that /x is derived from. 

3.3 Relationship of two-stage algorithms with ensemble techniques 

An alternative two-stage technique consists of first learning a prediction hypothesis hk using each 
kernel K^, k G [l,p], and then of learning the best linear combination of these hypotheses: h = 
Y^c=i P'khk- But, such ensemble-based techniques make use of a richer hypothesis space than 
the one used by learning kernel algorithms such as that of Lanckriet et al. (2004). For ensemble 
techniques, each hypothesis hk, k G [l,p], is of the form hk = Y^ILi a ikKk(xi, •) for some ctk = 
{a\ki ■ ■ ■ i «mfc) T G ^ m with different constraints ||a:fe|| < A/., A*. > 0, and the final hypothesis is 
of the form 

p p m m p 

^/ifc/ifc = y^Pk^oiikKkixj,-) = ^^ PkUik K k(xi, •)■ 

k=\ k=\ 3=1 i=l fc=l 

In contrast, the general form of the hypothesis learned using kernel learning algorithms is 



m 

£■ 

i=l t=l fe=l fe=l i=l 



aiK^Xi, •) = ^J en ^2 PkKkixi, •) = ^2 5Z P-kOiKkixi 



for some a G M m with ||a|| < A, A > 0. When the coefficients a^ can be decoupled, that is 
Oik = otiftk for some fas, the two solutions seem to have the same form but they are in fact different 
since in general the coefficients must obey different constraints (different A^s). Furthermore, the 
combination weights \n are not required to be positive in the ensemble case. We are presenting 
elsewhere a more detailed theoretical and empirical comparison of the ensemble and learning kernel 
techniques. 

3.4 Single-stage alignment-based algorithm 

This section analyzes an optimization based on the notion of centered alignment, which can be 
viewed as the single-stage counterpart of the two-stage algorithm discussed in previous sections. 

As in the previous section, we denote by a the vector ((Ki c ,yy T )^, . . . , (K Pc ,yy T )^) T and 
let M G W xp be the matrix defined by M^ = (Kfc c ,K; c )^. The optimization is then defined 
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by augmenting standard single-stage learning kernel optimizations with an alignment maximization 
constraint. Thus, the domain A4 of the kernel combination vector /x is defined by: 

M = {ju: fi >0A\\ fi\\ < AAp(K M ,yy T )>ft}, 

for non-negative parameters A and fi. The alignment constraint ^(K^, yy T ) > fi can be rewritten 
as fivV^M ~~ M Ta < 0> which defines a convex region. Thus, M. is a convex subset of MP. 

For a fixed /j£M and corresponding kernel matrix K M , let F(fj., a) denote the objective 
function of the dual optimization problem minimize ae ^F(/x, a) solved by an algorithm such as 
SVM, KRR, or more generally any other algorithm for which A is a convex set and F(fi, •) a 
concave function for all /x G M, and F(-,a) convex for all a G A Then, the general form of a 
single-stage alignment-based learning kernel optimization is 

min maxF(/i, a). 

fJ,eM a£A 

Note that, by the convex-concave properties of F and the convexity of M. and A, von Neumann's 
minimax theorem applies: 

min max F(u, a) = max min F(u, a). 
neM aeA aeA fieM 

We now further examine this optimization problem in the specific case of the kernel ridge regression 
algorithm. In the case of KRR, F(fi, a) = — a T (K M + XI)a + 2a T y. Thus, max-min problem can 
be rewritten as 

max min — ct T (K M + XI)a + 2a T y. 

Let b a denote the vector with the kth component a T K.i : a, then the problem can be rewritten as 

max —Xa a + 2a y — max /i h a , 

aeA n&M 

where A = Xqiti in the notation of equation (14). By the convexity of the last constraint in M, the 
last optimization problem can be rewritten using the Lagrange technique as 



min — /i h a + 7(fi-\/// r M/i, — \i a) 
subject to fj, > A \\fj,\\ < A A 7 > 0. 

Viewing fi as a Lagrange variable leads equivalently to 

min — /j, (7a + h a ) 
subject to /x > A ||/x|| < A A 7 > A //M/x < Q' 2 . 

Reusing the Lagrange technique leads to 

min — /j, (7a + h>Q,) + 7'//, /1 + j"fj, M/x 
subject to /x > A 7, 7', 7" > 0. 

This is a simple QP problem. Note that the overall problem can now be written as 

max — Xa a + 2a y + // (7a + h a ) — 7'/^ /x — 7"/^ M/i. 

a£A,fi>0 
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This last problem is not convex in (a, fx), but the problem is convex in each variable. In the case 
of kernel ridge regression, the maximization in a admits a closed form solution. Plugging in that 
solution yields the following convex optimization problem in fi: 

miny T (K M + AI) _1 y - 7M T a + » T (^"M + 7 'l>. 

Note that multiplying the objective by A using the substitution //' = jfj, results in the following 
equivalent problem, 

min y T (K M , + I^y - A 2 7/ / T a + ^(A 3 7 "M + A 3 7 'I)/A 

which makes clear that the trade-off parameter A can be subsumed by the 7, 7' and 7" parameters. 
This leads to the following simpler problem with a reduced number of trade-off parameters, 

miny T (K M + I^y - 7/ /a + //( 7 "M + 7 'l)/x. (11) 

fi>0 

This is a convex optimization problem. In particular, /x 1— > y T (K M + I)~ 1 y is a convex funtion by 
convexity of / : M4 y T M _1 y over the set of positive definite symmetric matrices. The convexity 
of / can be seen from that of its epigraph, which, by the property of the Schur complement, can be 
written as follows (Boyd and Vandenberghe, 2004): 

epi/ = {(M,t):M^O,y T M- 1 y<i} = {(M,t): Q£ jjbO.M^O}. 

This defines a linear matrix inequality in (M, t) and thus a convex set. The convex optimization 
(11) can be solved efficiently using a simple iterative algorithm as in (Cortes et al., 2009a). In 
practice, the algorithm converges within 10-50 iterations. We have run experiments comparing this 
single-stage centered alignment algorithm with the two-stage one presented in the previous sections. 
Section 5 reports the results of these and all of our other experiments. 

4. Theoretical results 

This section presents a series of theoretical guarantees related to the notion of kernel alignment. 
Section 4.1 proves a concentration bound of the form \p — p\ < 0(1/ \fm), which relates the 
centered alignment p to its empirical estimate p. In Section 4.2, we prove the existence of accurate 
predictors in both classification and regression in the presence of a kernel K with good alignment 
with respect to the target kernel. Section 4.3 presents stability-based generalization bounds for the 
two-stage alignment maximization algorithm whose first stage was described in Section 3.2.2. 

4.1 Concentration bounds for centered alignment 

Our concentration bound differs from that of Cristianini et al. (2001) both because our definition 
of alignment is different and because we give a bound directly on the quantity of interest \p — p\. 
Instead, Cristianini et al. (2001) give a bound on \A! — A\, where A' 7^ A can be defined from A by 
replacing each Frobenius product with its expectation over samples of size m. 

The following proposition gives a bound on the essential quantities appearing in the definition 
of the alignments. The proof is moved to the appendix. 
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Proposition 11 Let K and K 7 denote kernel matrices associated to the kernel functions K and K' 
for a sample of size m drawn according to D. Assume that for any x € X, K(x,x) < R 2 and 
K'{x, x) < R' 2 . Then, for any 5 > 0, with probability at least 1 — 5, the following inequality holds: 



(K C ,K' C ) F 



m^ 



E[K C K' C \ 



s l^! +2 4flW° g 



III 



2m 



Note that in the case K'(xi, Xj) = ViVj, we then have R' 2 < maxj y 2 . 

Proof The proof relies on a series of lemmas given in the Appendix. By the triangle inequality and 

in view of Lemma 19, the following holds: 

18R 2 R' 2 



(K C ,K' C ) F 



???.- 



E[K C K' C ] 



< 



<K C ,K' C ) F 



E 



m^ 



(Kc,K' c )jr 



???.- 



+ 



in 



Now, in view of Lemma 18, the application of McDiarmid's inequality (McDiarmid (1989)) to 

(K C ,K') F ■ f . n 

x a e/ gives tor any e > 0: 



Pr 



(K C ,K' C ) F 



E 



m" 



<K C ,K' C ) F 



m z 



> e 



<2exp[-2me 2 /{24R 2 R 



2 D '2\2i 



Setting 5 to be equal to the right-hand side yields the statement of the proposition. 



Theorem 12 Under the assumptions of Proposition 1 1, and further assuming that the conditions of 
the Definitions 2-4 are satisfied for p(K, K') and p(K, K'), for any 5 > 0, with probability at least 
1 — 5, the following inequality holds: 



\p(K,K')-p(K,K')\<18/3 



3 

— + 
m 



2m 



with P = max(R 2 R' 2 / E[K 2 ],R 2 R' 2 / E[K' C 2 }). 

Proof To shorten the presentation, we first simplify the notation for the alignments as follows: 

b b 



p(K,K') 



p(K,K') 



yad yaa' 

with b = E[K C K' C ], a = E[K 2 }, a' = E[K' C 2 ] and similarly, b = (l/m 2 )(K c ,K' c } F , a = 
(l/7n 2 )||K c || 2 , and a? = (l/7n 2 )||K^|| 2 . By Proposition 11 and the union bound, for any 5 > 0, 
with probability at least 1 — 5, all three differences a — a, a' — a', and b — b are bounded by 



a 



iSR^l + 24&R 



I i op . b 

2m~' Using the definitions of p and p, we can write: 



\p(K,K')-p(K,K') 



b\J aa! — b\J aa! 



y/aa' y/aa 

(b — b)Vaa/ — b{y/aa' 



y/aa'aa' 



aa 



v/ 



(6-6) 



'aa 



aa aa 



p(K,K')- 



/ aa/(y/aa/ + y/aa') 
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Since p(K, K') G [0, 1], it follows that 



\p(K,K')-p(K,K')\<K4 + 



lad — ad\ 



'ad 



1 'aa!{y 'ad + \fad) 



Assume first that a < a'. Rewriting the right-hand side to make the differences a — a and a' — a! 
appear, we obtain: 

ip(K,K0-p(K,K0i<^2+ l(a " aK+S(a/ " s/)l 



< 



< 



• aa 

a 
I ad 

a 
I ad 



1 'aa (v 'ad + Vad) 



1 + 



a + a 



aa + v ad 



< 



n 



'aa' 



a a 

1 + ^= + 



/ aa 7 va(7 



a 



2 1 

+ - 



'ad a 



a. 



We can similarly obtain 

3max(f,f). 



i_xi 

/TT7 T „' 



a when a' < a. Both bounds are less than or equal to 



4.2 Existence of good alignment-based predictors 

For classification and regression tasks, the target kernel is based on the labels and defined by 
Ky (x, x') = yy', where we denote by y the label of point x and y' that of x'. This section shows the 
existence of predictors with high accuracy both for classification and regression when the alignment 
p(K, Ky) between the kernel K and Ky is high. 

In the regression setting, we shall assume that the labels have been first normalized by dividing 
by the standard deviation (assumed finite), thus E[y 2 ] = 1. In classification, y = ±1 and thus 
E[y 2 ] = 1 without any normalization. Denote by h* the hypothesis defined for all x G X by 



h*(x) 



E x ,[y'K c (x,x')] 



Observe that by definition of h* , E x [yh*(x)] = p(K,Ky). For any x G X, define 7(3;) = 
v x \iru' X mi an d T = maxj; j(x). The following result shows that the hypothesis h* has high 
accuracy when the kernel alignment is high and T not too large. 2 

Theorem 13 (classification) Let R(h*) = Pr[yh*(x) < 0] denote the error ofh* in binary classi- 
fication. For any kernel K such that < E[iiT 2 ] < +00, the following holds: 

R(h*)<l-p(K,Ky)/r. 



2. A version of this result was presented by Cristianini, Shawe-Taylor, Elisseeff, and Kandola (2001) and Cristianini, 
Kandola, Elisseeff, and Shawe-Taylor (2002) for the so-called Parzen window solution and non-centered kernels. 

However, both proofs are incorrect since they seem to rely implicitly on the fact that max^ [ T M)?,'* mi ! 2 = 1» 
which can only hold in the trivial case where the kernel function K is a constant: by definition of the maximum and 
expectation operators, max^ [E x '[A' 2 (a;, a;')]] > F, x ['E x i[K 2 (x,x')]], with equality only in the constant case. 
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Proof Note that for all x € X, 



= \yE x ,[y'K c (x,x')]\ < ygggj^^gg)] = ^E x/ [K 2 (x,x')} < 



VmT] 



vwm 



vw 



In view of this inequality, and the fact that E x [yh* (x)] = p(K, Ky), we can write: 

1-R(h*) = Pi[yh*(x) >0] 
= E [l{j//i*(a;)>0}] 

~yh*(x). 



>E 

> E 



r 

y/i*(x) 



L {yh*(x)>0} 



P {K,K Y )/T, 



where l w is the indicator function of the event uj. ■ 

A probabilistic version of the theorem can be straightforwardly derived by noting that by Markov's 
inequality, for any 5 > 0, with probability at least 1 — 5, \"f(x)\ < 1/y/S. 

Theorem 14 (regression) Let R(h*) = E x [(y — h*(x)) 2 ] denote the error of h* in regression. For 
any kernel K such that < E[i^] < +oo, the following holds: 

R(h*) < 2(1 -p(K,K Y )). 

Proof By the Cauchy-Schwarz inequality, it follows that: 

~E x ,[y'K c {x,x')} 2 ' 



E[h* z (x)] = E 



< E 

X 



E x ,[y >2 ]E xl [K 2 c (x,x')] 



E x ,[y ,2 }E x , x ,[K 2 c (x,x') 

m 2 c ] 



ny 2 } = i. 



Using again the fact that E x [yh* (x)] = p(K, Ky), the error of h* can be bounded as follows: 
E[(y - h*(x)) 2 } = E[h*(x) 2 ] + E[y 2 ] -2E[yh*(x)] < 1 + 1 - 2p(K,K Y ), 

which concludes the proof. ■ 

The hypothesis h* is closely related to the hypothesis h* s derived as follows from a finite sample 
S = ((xi,yi),...,(x m ,y m )): 



h s (x) 



^YA=lVi K c{x,Xi) 



ri 1 2-ri,j=l Kc{Xi, Xj) J -^p 2^,i i j = i\yiVj) 



(12) 



Note in particular that E x [yhs(x)] = jo(K, Ky), where we denote by E the expectation based on 
the empirical distribution. Using this and other results of this Section, it is not hard to show that 
\R(h*) — R(h* s )\ < 0(1/ y/m) both in the classification and regression settings. 
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For classification, the existence of a good predictor g* based on the unnormalized alignment 
can also be shown. The corresponding guarantees are simpler and do not depend on a term such as 
T. However, unlike the normalized case, the loss of the predictor g^ derived from a finite sample 
may not always be close to that of g*. Note that in classification, for any label y, \y\ = 1, thus, by 
Lemma 6, the following holds: < p u {K, Ky)\ < R 2 ■ Let g* be the hypothesis defined by: 



g*(x) = E[y'K c (x,x% 



(13) 



Since < p u (K,Ky)\ < R 2 , the following theorem provides strong guarantees for g* when the 
unnormalized alignment a is sufficiently large, that is close to R 2 . 

Theorem 15 (classification) Let R(g*) = Pi[yg*(x) < 0] denote the error of g* in binary classi- 
fication. For any kernel K such that sup xe ^ K c {x, x) < R 2 , we have: 

R{g*)<l- P u(K,K Y )/R 2 . 

Proof Note that for all x G X , 

\yg*(x)\ = \g*(x)\ = \E[y'K c (x,x')]\ < R 2 . 

X' 

Using this inequality, and the fact that E x [yg*(x)] = p u {K, Ky), we can write: 



1 - R(g*) = Pr[yg*(x) > 0] = E[l {yg , {x)m ] 



>E 

> E 



R 2 

yg*(%) 

R 2 



{yh*(x)>0} 
= p u (K,K Y )/R 2 , 



which concludes the proof. 



4.3 Generalization bounds for two-stage learning kernel algorithms 

This section presents stability-based generalization bounds for two-stage learning kernel algorithms. 

We present learning bounds for the case where the second stage of the algorithm is kernel ridge 
regression (KRR). Similar results can be given in classification using algorithms such as SVMs in 
the second stage. Thus, in the first stage, the algorithms we examine select a combination weight 
parameter /xeA^ g = {/^:/x>0,||/i||5 = A g }whichdefinesa kernel K^ , and in the second stage use 
KRR to select a hypothesis from the RKHS associated to K^. While several of our results hold in 
general, we will be more specifically interested in the alignment maximization algorithm presented 
in Section 3.2.2. 

Recall that for a fixed kernel function K^ with associated RKHS Hk^ and training set S = 
((x'i, t/i), . . . , (x m ,y m )), the KRR optimization problem is defined by the following constraint op- 
timization problem: 



min G(h) 
hen Ku , 



\ \\h\\ 2 



1 



Til 



H }(H x i) -Vi) 



111 



(14) 
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We first analyze the stability of two-stage algorithms and then use that to derive a stability-based 
generalization bound (Bousquet and Elisseeff, 2000). More precisely, we examine the pointwise 
difference in hypothesis values obtained on any point x when the algorithm has been trained on two 
datasets S and S' of size m that differ in exactly one point. 

In what follows, we denote by ||K|| S] t = (X^aUi l|Kfc||s) the (s,£)-norm of a collection of 
matrices and by A/i the the difference /x' — /i of the combination vector /x' and /i returned by the 
first stage of the algorithm by training on S, respectively 5'. 

Theorem 16 (Stability of two-stage learning kernel algorithm) Let S and S' be two samples of 
size m that differ in exactly one point and let h and hi be the associated hypotheses generated by a 
two-stage KRR learning kernel algorithm with the constraint /x G M.\. Then, for any s, t > 1 with 
i + \ = 1 and any x E X: 



.,„ , , , ,. 2AiR 2 M 
\ti(x)-h(x)\ < 



XqIJI 



||A/A|| s ||K c || 2 ,t 

2An 



where M is an upper bound on the target labels and R = sup^n „i Kf.(x, x). 

xeX 

Proof The hypothesis returned by KRR can be parametrized by the kernel weight vector /x, which 
defines the kernel function, and the sample S, which is used to populate the kernel matrix, and will 
be explicitly denoted h^g. To estimate the stability of the overall two-stage algorithm, A/i„ 5 = 
hfj,',s' ~ hfj,,S> we use tne decomposition 

and bound each parenthesized term separately. The first parenthesized term measures the pointwise 
stability of KRR due to a change of a single training point with a fixed kernel. This can be bounded 
using Theorem 2 of Cortes et al. (2009a). Since, for all x G X, K^x, x) = Y%=\ ^kKk{x, x) < 
R 2 Sfc=i M* < Aii? 2 > using that theorem yields the following bound: 

« 1, / x > / m 2Aii? 2 M 

\/x G X, \h^ sl {x) - h^s(x)\ < — . 

The second parenthesized term measures the pointwise difference of the hypotheses due to the 
change of kernel from K M ' to K M for a fixed training sample when using KRR. By Proposition 1 of 
Cortes et al. (2010c), this term can be bounded as follows: 

A R 2 A/f 

vx g x, iv,s(z) - Vs(x)| ^ ~\i — H K M' " k mII- ( 15 ) 

The term ||K^/ — K M || can be bounded using Holder's inequality as follows: 

v v 

||K„/ - K M || = || ^(A^)K fe || < Y^ l^/ifcl ||Kfe|| < ||A/x||,||K|| 2 , t , 
fc=i fc=i 

which completes the proof. ■ 

The pointwise stability result just presented can be used directly to derive a generalization bound 
for two-stage learning kernel algorithms as in (Bousquet and Elisseeff, 2000). 
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For a hypothesis h, we denote by R(h) its generalization error and by R(h) its empirical error 

on a S = ((xi, yi), ..., (x m , y m ))\ 

-. m 

R(h) = E [(h s (x) - yf] R{h) = - T(h s (x t ) - y t f. 

t.ai rn ^ — ^ 



i=l 

Theorem 17 (Stability-based generalization bound) Let hs denote the hypothesis returned by a 
two-stage KRR kernel learning algorithm with the constraint fi G M\ when trained on sample S. 
For any s,t > 1 with - + - = 1, with probability at least 1 — 5 over samples S of size m, the 
following bound holds: 



, . ~ , 2M 1 M 2 ( 16M 2 \M 1 M 2 /log I 
R{h s )<R{h s ) + — + (1 + 2 ^ 



m 



with Mi = 2 



iWf M«^M 2 = ¥l + ^» 



Mi y 4 V 2m ' 

n M. 



Proof Since hs is the minimizer of the objective (14) and since belongs to the hypothesis space, 

1 m 

G{h s ) < G(0) = - V(0 - Vl f < M 2 . 

rn £■ — * 



m 



t=l 



Furthermore, since the mean squared loss is non-negative, we can write: Ao||fos|||f < G(hs). 
Therefore, [|/i5||x < t - - By the reproducing property, for any x G X, 



\h 8 (x)\ = \(h s ,K^x,-)) K J < \\h s \\ K ^K^x,x) 




a ^2^kK k (x,x) 
\ fe=i 



^*^ SM ^' 



Thus, for all x G A" and y G [— M, M], the squared loss can be bounded as follows 

fJ^\ Mi 



\h s (x)-y\ < [M + RM 




A f 



This implies that the squared loss is Mi-Lipschitz and by Theorem 16 that the algorithm is stable 
with a uniform stability parameter /3 < M1M2. bounded as follows: 



\(h s >(x) - yf - (h s (x) - y) 2 \ < M^hs^x) - h s (x)\ < 



MiM 2 



■/?? 



The application of Theorem 12 of (Bousquet and Elisseeff, 2000) with the bound on the loss -^- 
and the uniform stability parameter /5 directly yields the statement. ■ 

The inequality just presented holds for all two-stage learning kernel algorithms. To determine its 
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convergence rate, the term ||A/Lt|| s ||K c ||2,t must be bounded. Let s = 1 and t = oo, and assume 
that the base kernels K&, k G [l,p], are trace-normalized as in our experiments (Section 3), then a 
straightforward bound can be given for this term: 

||A/x||i||K c || 2 ,oo < (IIm'IIi + IImIIi) max ||K fec || 2 < max 2Ai Tr[K fcc ] < 2Ai. 

fce[i,fe] ke[i,k] 



! + fc 



M and, for Ai and 



9 A ff^ 

Thus, in the statement of Theorem 17, M 2 can be replaced with \ 

Ao constant, the learning bound converges in 0(l/y/m). 

The straightforward upper bound on || A/x|| s ||K c ||2,t applies to all such two-stage learning kernel 
algorithms. For a specific algorithm, finer or more favorable bounds could be derived. We have 
initiated this study in the specific case of alignment maximization algorithm. The result given in 
Proposition 21 (Appendix B) can be used to bound ||A/z||2 and thus ||A/ti||2||K c ||2,2- 

Note that in the specific case of alignment maximization algorithm, if /x* is the solution obtained 
for the constraint fj, € M.2, then it is also the alignment maximizing solution found in the set fi € M.\ 
with Ai = || fi* ||i < ^/p||/x||2 < y/pA2- This makes the dependency on p explicit in the case of a 
constraint /i G M.2- 

5. Experiments 

This section compares the performance of several learning kernel algorithms for classification 
and regression. We compare the alignment-based two-stage learning kernel algorithms algorithms 
align and alignf , as well as the single-stage algorithm presented in Section 3 with the following: 

Uniform combination (unif): this is the most straightforward method, which consists of choosing 
equal mixture weights, thus the kernel matrix used is, 

A p 
K, = -VK, (16) 

y fe=i 

Nevertheless, improving upon the performance of this method has been surprisingly difficult for 
standard (one-stage) learning kernel algorithms Cortes (2009). 

Norm-1 regularized combination (ll-svm): this algorithm optimizes the SVM objective 

minmax 2a 1 — a Y KmYq 

subject to: /x > 0, TrfK^] < A, a T y = 0, < ct < C, 

as described by Lanckriet et al. (2004). Here, Y is the diagonal matrix constructed from the labels 
y and C is the regularization parameter of the SVM. 

Norm-2 regularized combination (12-krr): this algorithm optimizes the kernel ridge regression 
objective 

min max — Xa a — a K M o; + 2a y 

fj, a 



subject to: fi > 0, ||/x — /a ||2 < A, 
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KINEMATICS 


IONOSPHERE 


GERMAN 


SPAMBASE 


SPLICE 


m 


1000 


351 


1000 


1000 


1000 


7 


-3,3 


-3,3 


-4,3 


-12, -7 


-9,-3 


unif 


.138 ±.005 
.158 ±.013 


.467 ± .085 
.242 ± .021 


25.9 ±1.8 
.089 ± .008 


18.7 ±2.8 
.138 ±.031 


15.2 ±2.2 
.122 ±.011 


1-stage 


.137 ±.005 
.155 ±.012 


.457 ± .085 
.248 ± .022 


26.0 ±2.6 
.082 ± .003 


20.9 ±2.80 
.099 ± .024 


15.3 ±2.5 
.105 ± .006 


align 


.125 ± .004 
.173 ±.016 


.445 ± .086 
.257 ± .024 


25.5 ±1.5 
.089 ± .008 


18.6 ±2.6 
.140 ±.031 


15.1 ±2.4 
.123 ±.011 


alignf 


.115 ±.004 
.176 ±.017 


.442 ± .087 
.273 ± .030 


24.2 ±1.5 
.093 ± .009 


18.0 ±2.4 
.146 ±.028 


13.9 ±1.3 
.124 ±.011 



Regression 



Classification 



Table 2: Error measures (top) and alignment values (bottom) for unif, 1-stage (12-krr or 
11-svm), align and alignf with kernels built from linear combinations of Gaussian 
base kernels. The choice of 70 , 71 is listed in row labeled 7, and m is the size of the 
dataset used. Shown with ±1 standard deviation measured by 5-fold cross-validation. 



as described in Cortes et al. (2009a). Here, A is the regularization parameter of KRR, and /x is an 
additional regularization parameter for the kernel selection. 

In all experiments, the error measures reported are for 5-fold cross validation, where, in each 
trial, three folds are used for training, one used for validation, and one for testing. For the two- 
stage methods, the same training and validation data is used for both stages of the learning. The 
regularization parameter A is chosen via a grid search based on the performance on the validation 
set, while the regularization parameters C and A are fixed since only the ratio C/A and A/A are 
important. More explicitly, for the KRR algorithm, scaling the vector /x by A results in a scaled 
dual solution: a = (K /x A±AI)~ 1 y = A _1 (K M ± -^I) _1 y. In turn, we see that the primal solution 
h{x) = Y^ILi A-~ la iA-K/j.( x i x i) = YliLi a iKfji{x,Xi) is equivalent to the solution of the KRR 
algorithm that uses a regularization parameter equal to A/A without scaling /1 and, thus, it suffices 
to vary only one regularization parameter. In the case of SVMs, the scale of the hypothesis does not 
change its sign (or the binary prediction) and thus the same property can be shown to hold. The fj, 
parameter is set to zero in Section 5.1, and is chosen to be uniform in Section 5.2. 

5.1 General kernel combinations 

In the first set of experiments, we consider combinations of Gaussian kernels of the form K 7 (xj , x^ ) = 
exp(— 7||xj — Xj|| 2 ), with varying bandwidth parameter 7 £ {2 70 , 2 70+1 , . . . , 2 1-71 , 2 71 }. The 
values 70 and 71 are chosen such that the base kernels are sufficiently different in alignment and 
performance. Each base kernel is centered and normalized to have trace equal to one. We test the al- 
gorithms on several datasets taken from the UCI Machine Learning Repository (http : //archive . 
ics.uci.edu /ml/) and Delve (http : //www. cs.toronto. edu/ -delve /data /data sets .html). 

Table 2 summarizes our results. For classification, we compare against the 11-svm method and 
report the misclassification percentage. For regression, we compare against the 12-krr method and 
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2000 4000 6000 8000 10000 12000 14000 16000 18000 

Two-stage |X 



Figure 3: A scatter plot comparison of the different kernel combination weight values obtained by 
optimally tuned one-stage and two-stage algorithms on the kinematics dataset. 



report RMSE. In general, we see that performance and alignment are well correlated. In all datasets, 
we see improvement over the uniform combination as well as the one-stage kernel learning algo- 
rithms. Note that although the align method often increases the alignment of the final kernel, as 
compared to the uniform combination, the alignf method gives the best alignment since it directly 
maximizes this quantity. Nonetheless, align provides an inexpensive heuristic that increases the 
alignment and performance of the final combination kernel. 

In our experiments with the one-stage KRR algorithm presented in Section 3.4, there was no 
significant improvement found over the two-stage alignf algorithm with respect to the kinematics 
and ionosphere datasets. In fact, for optimally cross-validated parameters 7, 7' and 7" the solu- 
tion combination weights were found to closely coincide with the alignf solution (see Figure 3). 
This would suggest the use of the two-stage algorithm over the one-stage, since there are fewer 
parameters to tune and the problem can be solved as a standard QR 

To the best of our knowledge, these are the first kernel combination experiments for alignment 
with general base kernels. Previous experiments seem to have dealt exclusively with rank- 1 base 
kernels built from the eigenvectors of a single kernel matrix Cristianini et al. (2001). In the next 
section, we also examine rank-1 kernels, although not generated from a spectral decomposition. 



5.2 Rank-1 kernel combinations 

In this set of experiments we use the sentiment analysis dataset version 1 from Blitzer et al. (2007): 
books, dvd, electronics and kitchen. Each domain has 2,000 examples. In the regression setting, the 
goal is to predict a rating between 1 and 5, while for classification the goal is to discriminate positive 
(ratings > 4) from negative reviews (ratings < 2). We use rank-1 kernels based on the 4,000 most 
frequent bigrams. The kth base kernel, K&, corresponds to the kth bigram count V&, K^ = Vfcv£. 
Each base kernel is normalized to have trace 1 and the labels are centered. 

The alignf method returns a sparse weight vector due to the constraint /1 > 0. As is demon- 
strated by the performance of the ll-svm method, Table 3, and also previously observed by Cortes 
et al. (2009a), a sparse weight vector /1 does not generally offer an improvement over the uniform 
combination in the rank-1 setting. Thus, we focus on the performance of align and compare it 
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BOOKS 


DVD 


ELEC 


KITCHEN 


unif 


1.442 ± .015 
.029 ± .005 


1.438 ± .033 
.029 ± .005 


1.342 ± .030 
.038 ± .002 


1.356 ±.016 
.039 ± .006 


12-krr 


1.414 ±.020 
.031 ± .004 


1.420 ± .034 
.031 ± .005 


1.318 ±.031 
.042 ± .003 


1.332 ±.016 
.044 ± .007 


align 


1.401 ± .035 
.046 ± .006 


1.414 ±.017 
.047 ± .005 


1.308 ± .033 
.065 ± .004 


1.312 ±.012 
.076 ± .008 



Regression 





BOOKS 


DVD 


ELEC 


KITCHEN 


unif 


25.8 ±1.7 
.030 ± .004 


24.3 ±1.5 
.030 ± .005 


18.8 ±1.4 
.040 ± .002 


20.1 ±2.0 
.039 ± .007 


11-svm 


28.6 ±1.6 
.029 ± .012 


29.0 ±2.2 
.038 ±.011 


23.8 ±1.9 
.051 ± .004 


23.8 ±2.2 
.060 ± .006 


align 


24.3 ±2.0 
.043 ± .003 


21.4 ±2.0 
.045 ± .005 


16.6 ±1.6 
.063 ± .004 


17.2 ±2.2 
.070 ±.010 



CLASSIFICATION 

Table 3: The error measures (top) and alignment values (bottom) for kernels built with rank-1 fea- 
ture based kernels on four domain sentiment analysis domains. Shown with ±1 standard 
deviation as measured by 5 -fold cross-validation. 



to unif and one-stage learning methods. Table 3 shows that align significantly improves both 
the alignment and the error percentage over unif and also improves somewhat over the one-stage 
12-krr algorithm. Although the sparse weighting provided by 11-svm improves the alignment in 
certain cases, it does not improve performance. 

6. Conclusion 

We presented a series of novel algorithmic, theoretical, and empirical results for learning kernels 
based on the notion of centered alignment. Our experiments show a consistent improvement of the 
performance of alignment-based algorithms over previous learning kernel techniques, as well as 
the straightforward uniform kernel combination, which has been difficult to surpass in the past, in 
both classification and regression. The algorithms we described are efficient and easy to implement. 
They can be used in a variety of applications to improve performance. We also gave an extensive 
theoretical analysis which provides a number of guarantees for centered alignment-based algorithms 
and methods. Several of the algorithmic and theoretical results presented can be extended to other 
learning settings. In particular, methods based on similar ideas could be used to design learning 
kernel algorithms for dimensionality reduction. 

The notion of centered alignment served as a key similarity measure to achieve these results. 
Different methods based on possibly different efficiently computable similarity measures could be 
used to design effective learning kernel algorithms. In particular, the notion of similarity suggested 
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by Balcan and Blum (2006), if it could be computed from finite samples, could be used in a equiva- 
lent way. 
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Appendix A. Lemmas supporting proof of Proposition 11 

For a function / of the sample S, we denote by A(/) the difference f(S') — f(S), where S' is 
a sample differing from S by just one point, say the m-th point is x m in S and x' m in S'. The 
following perturbation bound will be needed in order to apply McDiarmid's inequality. 



Lemma 18 Let K and K' denote kernel matrices associated to the kernel functions K and K' for 
a sample of size m according to the distribution D. Assume that for any x G X, K(x, x) < R 2 and 
K'(x,x) < R' 2 . Then, the following perturbation inequality holds when changing one point of the 
sample: 

1 ,.,,__ T ,,, . i 24-Fi R 
-z\A((K c ,K> c ) F )\< 



m- 



■m 



Proof By Lemma 1, we can write: 



(K c ,K' c ) F = (K c ,KV = Tr 

= Tr 



I- 

KK' 



IT 

m 



K 



ll 1 



;??■ 



K' 



11 T 11 T 11 T 11 T 

iL K K' - K— K' + — K— K' 

m m m m 

1 T (KK' + K'K)1 (1 T K1)(1 T K / 1) 
(K, K.)f 1 o 



■;??. 



m^ 



The perturbation of the first term is given by 



A((K, K') F ) = J2 A(K, m K' m ) + J2 A(K mi K' mi ). 



t=l 



By the Cauchy-Schwarz inequality, for any i,j£ [l,m], \Kij\ = \K(xi,Xj)\ <y / K(xi,Xi)K(xj,Xj) < 
R 2 and the product can be bound as |KijK£ | < |Kjj| |K^j| < R 2 R' 2 . The difference of products 
is then bound as |A(KyK$ )| < 2R 2 R' 2 . Thus, 



2r>/2 



1,. 2 " 1 -!^^2 D /2^ .^RR 



in- 



|A((K,K') F )|< 



m- 



-(2R 2 R' Z ) < 



171 
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Similarly, for the first part of the second term, we obtain 



1 



m' 



A 



l T KK'l 



in 



* E 



Ki*K'. 



I\) 



i,j,k=l 

in 



777 u 






771° 



777 u 



< 



m 2 + 777(777 - 1) + (m - l) 2 , 2R 2 R i2s < 3m 2 - 3m + 1 ^ 2R 2 R i2^ 



rrv 



m° 



< 



6R Z R 



2r./2 



m 



Similarly, we have: 



m*- 



A 



1 T K'K1 



/// 



< 



6R Z R 



2d'2 



■m 



The final term is bounded as follows, 



1 



77?/ 



A 



;i l Ki)(i i K / i) 



777 ^ 



< 



Z^i,j,k K-iJ^-km + £^i,j,k^m **-ij"- r 



ink 



•nr 



+ 



l^i,jjtm,k^m*^i™-*^jk + l^i^m,j^m,k^m^rni^jk 



m ' 



< 777 3 + 777 2 (777 - 1) + 777(?77 - l) 2 + (?77 - l) 3 /q^2^/2> 



77T 



< 



%R Z R 



2o/2 



in 



Combining these last four inequalities leads directly to the statement of the lemma. 



Because of the diagonal terms of the matrices, -^(KcK'Jj? is not an unbiased estimate of 
E[K C K' C ]. However, as shown by the following lemma, the estimation bias decreases at the rate 

0(1/777). 

Lemma 19 Under the same assumptions as Lemma 18, the following bound on the difference of 
expectations holds: 



E[K c (x,x')K' c (x,x')]-E 

r,x' b 



<K C ,K' C ) F 



777^ 



< 



18R Z R 



2e>/2 



in 



Proof To simplify the notation, unless otherwise specified, the expectation is taken over x, x' drawn 
according to the distribution D. 

The key observation used in this proof is that 



EpKyKy = ElK^x^K'ixuXj)] = E[KK% 



(17) 



for i,j distinct. For expressions such as Eg [K^K^ .] with i,j, k distinct, we obtain the following: 

E[K ifc K' fci ] = E[K( Xi , x k )K'(x k , Xj )} = E[E[K] E[K'}}. (18) 

.f J S rpl X X 



x' x 
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Let us start with the expression of E[K C K' C ]: 

E[K C K' C ] = E \(K - E[K] - E[K] + E[K]) (K' - E[K'} - ELK 7 ] + ELK 7 ] 



(19) 



After expanding this expression, applying the expectation to each of the terms, and simplifying, we 
obtain: 

E[K C K' C ] = E[KK'] - 2E \E[K] E[K']] + E[K] E[K'}. 

x L x' x' 

(K c , K.' c )p can be expanded and written more explicitly as follows: 

/T _ __,. ,__ __,. IKK 1 1 T KK1 1 T K'11 T K1 

(K c , K' C ) F = (K, K') F + s 
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*)i=i 



i,j,k=l 



*.i=i »,j=i 



To take the expectation of this expression, we shall use the observations (17) and (18) and similar 
identities. Counting terms of each kind, leads to the following expression of the expectation: 
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E[K]E[K'(x,x)]. 



Taking the difference with the expression of ELKc-K 7 ,] (Equation 19), using the fact that terms of 
form E x [K(x, x)K'(x, x)] and other similar ones are all bounded by R 2 R' 2 and collecting the terms 
gives 
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< 



m-l r>2 T?/2 



R R . Using again the fact that the expectations are bounded by R R yields 



E[K C K' C ]-E 



and concludes the proof. 
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Appendix B. Stability bounds for alignment maximization algorithm 

Lemma 20 Let /x = v/||v|| and // =v'/||v / ||. Then, the following identity holds for A/x = /x' — /x: 

Av (Av) T (v + v')v 



A/x 



Proof By definition of A/x, we can write 



A/x = A 



v v 



V V 



|v|| + ||v'| 



V V 



Observe that: 

Afllvll 



Afllvl 



a(ELi 



V V 



Av vAfllvl 



v v' 



(20) 



Er=iAK)(^ + ^)_(Av) T (v + v') 



||v|| + ||v'|| ||v|| + IJv'H ||v|| + ||v'|| ||v|| + ||v'|| 

Plugging in this expression in (20) yields the statement of the lemma. ■ 

Consider the minimization (7) shown by Proposition 9 to provide the solution of the alignment max- 
imization problem for a convex combination. The matrices M and a are functions of the training 
sample S. To emphasize this dependency, we rewrite that optimization for a sample S as 



mm F(S, v) 



v>0 



(21) 



where F(S, v) = v T Mv — 2v T a = || v|| jj^ — 2v T a. The following lemma provides a stability result 
for this optimization problem. 

Proposition 21 Let S and S' denote two samples of size m differing by only one point. Let v and 
v' be the solution of (21), respectively, for sample S and S'. Then, the following inequality holds 
for Av = v' — v: 

||Av||^ < [Aa- (AM)v] T Av. (22) 

Proof Since C = {v : v > 0} is convex, for any s G [0, 1], v + sAv and v' — sAv are in C. Thus, 
by definition of v' and v, 



F(S, v) < F(S, v + sAv) and F(S', v') < F(S', v' - sAv). 
Summing up these inequalities, we obtain 



(23) 



MIm- ||v + sAv||^ + llv'Ulj, 



sAvl 



M' 



\T„/ 



< 2v' a - 2(v + sAv) ' a + 2v" a' - 2(v + sAv) ' a' 



2[sa' Av - sa" Av] = 2s(Aa) ' Av. 



(24) 
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The left-hand side of this inequality can be rewritten as follows after expansion and using the identity 

ll v ' - sAv IIm' - ll v ' - sAv IIm = ll v ' - sAv IIam : 

- ||sAv||£j - 2sv T MAv + ||v'||^, - Hv'H^ - ||sAv||^ + 2sv /T M(Av) - ||v' - sAv||^ M 

= 2s(l - s)||Av||| I + ||v'||i M - ||v - sAv||i M . 

Then, expanding ||v' — sAv||^ M results in the final inequality 

2s(l - s)||Av||^ - s 2 ||Av||^ M + 2sv /T (AM)(Av) < 2s(Aa) T Av. (25) 

Dividing by s and setting s = yields 

II Av||^ + v /T (AM)(Av) < (Aa) T Av, (26) 

which concludes the proof of the lemma. ■ 
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